There is a growing abundance of unstructured data and obvious limitations in out of the box platforms we use to acquire and analyze data. There is a continously growing need to perform more timely and in-depth analysis to maximize potential insight. Achieving this requires larger datasets and a deeper understanding of how current tools technically function to conduct natural language processing and other analysis.
Are unwanted tweets streaming in because of vague keywords, is the script timing out, etc?
A linguistics professor was lecturing to her class one day. "In English," she said, "A double negative forms a positive. In some languages, though, such as Russian, a double negative is still a negative. However, there is no language wherein a double positive can form a negative."
A voice from the back of the room piped up, "Yeah . . .right."
import time
import json
from getpass import getpass
import tweepy
import MySQLdb
import unicodedata, re
from guess_language import guess_language
from datetime import datetime
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(map(unichr, range(0,32) + range(127,160)))
control_char_re = re.compile('[%s]' % re.escape(control_chars))
class StreamListener(tweepy.StreamListener):
def remove_control_chars(self,s):
return control_char_re.sub('',s)
def on_status(self, status):
db = MySQLdb.connect("localhost", "root", "password1", "twitdump")
coord = ''
if status.coordinates != None:
coord = status.coordinates['coordinates']
cursor = db.cursor()
try:
cleantweet = self.remove_control_chars(status.text)
cleantweet = cleantweet.encode('latin-1', 'replace')
language = guess_language(cleantweet)
#print 'TWEET[' + status.id_str + '] ' + cleantweet
#print 'about to insert... '
cursor.execute("""INSERT INTO tweets(status_id, tweet, created_at, retweet_count, coordinates, status_count, favorite_count, listed_count, utc_offset, description, location, followers_count, screen_name, friends_count, timezone, language) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)""",
(status.id_str, cleantweet, status.created_at,
status.retweet_count, coord, status.user.statuses_count,
status.user.favourites_count, status.user.listed_count,
status.user.utc_offset, status.user.description,
status.user.location, status.user.followers_count,
status.user.name, status.user.friends_count, status.user.time_zone, language))
#print 'entity is is: ' + status.id_str
for tag in status.entities['hashtags']:
#print 'tag: ' + tag['text']
cursor.execute("""INSERT INTO hashtags(status_id, hashtag) VALUES (%s, %s)""",
(status.id_str, tag['text']))
for urlitem in status.entities['urls']:
#print 'urls: ' + urlitem['expanded_url']
#print 'shorturl: ' + urlitem['url']
fullurl = urlitem['expanded_url']
if len(fullurl) > 255:
fullurl = fullurl[0:255]
cursor.execute("""INSERT INTO urls(status_id, fullurl, url) VALUES (%s, %s, %s)""",
(status.id_str, fullurl, urlitem['url']))
db.commit()
except Exception, e:
now = datetime.now()
print 'exception [' + now.strftime('%Y-%m-%d %H:%M') + '] ' + str(e)
db.rollback()
db.close()
def on_error(self, status_code):
print 'Error code = %s' % status_code
return True
def on_timeout(self):
print 'timed out.....'
if __name__ == "__main__":
consumer_key = 'NoJ5J9Ky2gCNZSidtYAExg'
consumer_secret = 'Qp98zFS1oveq3QtLr7H4zwB43xpeF9zH4GI5qStDU'
access_key = '449837385-HIhvUsYF5S6Eo3xpR3wE3Kr3bReiPqBJqU7tQ54'
access_secret = '5hJvmy3WRKky91Uv7BpQEOYWBOkjZlEe4LLFAoG0ec'
auth1 = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth1.set_access_token(access_key, access_secret)
api = tweepy.API(auth1)
l = StreamListener()
print 'about to stream'
stream = tweepy.Stream(auth1, l)
setTerms = ['windows8', 'windows 8', 'win 8', 'win8', 'surface tablet', 'microsoft surface','samsung', 'apple', 'kindle', 'galaxy', 'nexus', 'ipad', 'iphone', 'mac', 'osx','iOS']
stream.filter(None,setTerms)
Tweets streamed successfully from the evening of 11/30 to the morning of 12/2
In 36 hours accumulated roughly 1.5MM
Followers count, listed count, retweet count, favorite count, coordinates, created at, status count, friends count, language
1.5 MM compared to 2K
Keywords must be carefully considered and chosen
Ex. Galaxy were playing for MLS Cup,'win 8' very common during football season, etc
All parts of keywords are searched for
Ex. During football season 'win 8' returns a lot of tweets such as, "We are going to win our 8 game this season" or, "Great article, this guy is a winner in the literary field: http://t.co/4567879"
Accuracy depends entirely upon sentiment training set
Can cater training set to your business to be very accurate.
Or could have very inaccurate results
#1 Article on Hacker News yesterday
...even then they'd need a team of good NLP people to make it happen, not me explaining ML to their engineers on the board a few hours a week. Useable fine-grained sentiment analysis is not going to be solved as a side project.
CH does this well, but does not give us access to it. We run into issues such as sure it does not time out, manual upkeep, etc
This can particularly be good for specefic launches or events
Can use fields obtained such as followers count, listed count, retweet count, favorite count, friends count to find both impressions and quality of tweet to find overall reach
Would be difficult but very beneficial if used NLP to find individual tweets gender